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(54) Method and system for managing data service systems 



(57) A scheme Is described for a data service sys- 
tem (50) having a number of modules (51, 54-56, 
61-67). Some of the modules (51 , 54-56, 61-67) are in- 
terdependent. To measure the status of an individual 
module, the scheme first collects measurements from a 
number of measurement routes (110-116) that involve 
the module. Then the scheme analyzes the Interde- 
pendencies of the measurements to determine the sta- 
tus of the individual module. The scheme may also de- 
termine status of the data service system (50) with a 



minimal number of measurement routes. This is done 
by determining (1 ) all possible measurement routes. (2) 
determining the dependency between the modules and 
the measurement routes, and (3) analyzing the depend- 
ency to select minimal number of the measurement 
routes. The scheme can diagnose whether a module is 
a problematic module or not by analyzing a number of 
measurements that involve the module. If one of the 
measurements is good, the module is Identified as non- 
problematic. 



FIG.7 
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Description 

The present invention pertains to data service sys- 
tems. More particularly, this invention relates to a sys- 
tem and method for providing integrated management 
of the availability and performance of a data service sys- 
tem having interdependent functional modules. 

Figure 1 shows a prior art Internet or Intranet access 
system 1 0. As can be seen from Figure 1 , the access 
system 10 typically includes an Internet/Intranet service 
system (ISS) 20 and an interconnect network 14 that 
connects the ISS 20 to subscriber sites (Figure 1 only 
shows one such site 11). Subscribers connect directly 
to the ISS 20 from their terminals (e.g., personal com- 
puters, Macintoshes, Web terminals, etc.). A modem 13 
may be used to connect the user terminal 1 2 to the ISS 
20. 

The ISS 20 typically includes web content servers 
24 that store data for access from the subscriber sites. 
The content servers 24 typically support servers for In- 
ternet applications, such as electronic mail, bulletin 
boards, news groups, and World Wide Web access. In 
addition, the ISS 20 may have web proxy senders 25 that 
allow a network administrator to restrict access to the 
global Internet 15 or other ISSs 16. Another use of the 
proxy servers 25 is to cache frequently accessed data 
from the Internet. The ISS 20 may also include address 
assignment servers 22 and a network address translator 
27. The address assignment servers 22 assigns an ad- 
dress to a user terminal when it is first connected to the 
ISS 20. The assigned address uniquely identifies the 
terminal in the ISS 20. The network address translator 
27 is used when the ISS 20 uses different addresses for 
communication within the system 20 and for communi- 
cation outside the system 20. 

Subscribers in the ISS 20 usually refer to servers in 
the ISS 20, in the global Internet 15, and other ISSs 16, 
by their host names. However, routing of packets to and 
from the sen/ers is based on network addresses as- 
signed to the servers rather than the host names. In the 
ISS 20, Domain name servers (DNS) 23 are used to 
translate subscriber references to host names into net- 
work addresses of the servers. The DNS 23 may them- 
selves rely on other DNS servers in the global Internet 
1 5 and other ISSs 1 6 to determine the host name to net- 
work address mappings. 

Other components or modules that are typical of the 
ISS are a firewall 26 that controls access to and from 
the system 20 , and a router or routers 21 for routing 
transmissions to and from subscribers, and to and from 
the global Internet 15 and other ISSs 16. 

Data transfer between the ISS 20 and the subscrib- 
er site 11 is provided by the interconnect network 14. 
The network 14 can use a number of technologies sup- 
porting a wide range of bandwidths. 

In the ISS 20, the Internet Protocol (IP) is typically 
used for data communication to and from the various 
access servers 22-27, as well as with the global Internet 



15 and other ISSs 16. The Transmission Control Proto- 
col (TCP) that operates above the IP layer and ensures 
reliable delivery of information to and from the access 
sen/ers is used for reliable access to the web and proxy 
s servers in the ISS 20. the global Internet 15, and other 
ISSs 1 6. The application protocols used above the TCP 
layer are specific to the applications being accessed by 
subscribers. For example, the File Transfer Protocol 
(FTP) is used for file transfers and the Hyper Text Trans- 
10 port Protocol (HTTP) is used for web accesses. 

Management of a data service system (e.g. , ISS 20) 
typically includes the following functions: (1) monitoring 
the availability and performance of the system; and (2) 
diagnosing the availability and performance problems 
that occur during the operation of the system. Of course, 
the management may include other functions. 

Prior an testing and measurement tools have been 
developed to enable management of the individual func- 
tional modules of the ISS. For example, a prior art Multi- 
Router Traffic Grapher (MRTG) testing tool enables the 
forwarding rate of a router overtime to be observed. An- 
other prior art testing tool, PerfView (made by Hewlett- 
Packard Co. of Palo Alto, California), can monitor CPU, 
disk, and memory utilization on a specific host system. 
The resource consumption of specific functional mod- 
ules (e.g., web content servers) can also be monitored 
using PerfView. Tools for monitoring the performance of 
web content servers have also been developed, e.g, the 
public domain timeit tool. Moreover, many testing tools 
have also been developed to measure the performance 
of the interconnect network. These tools include Net- 
perf, throughput TCP (ttcp). 

One drawback of the prior art testing technologies 
is that they only measure the status (i.e., availability and 
performance) of individual modules without taking into 
consideration of the sen/ices the system provides. To 
assess the status of the ISS 20 and the various services 
offered by the ISS 20 (e.g.. news. FTP, Web access, 
etc.), a network operator has to manually check (i.e., 
individually test) each module of the ISS to determine 
the module's status and then correlate the status of all 
of the modules to figure out where the problem is in the 
ISS 20. This requires that the network operator under- 
stand not only the interconnections between the differ- 
ent modules of the ISS 20 but also the logical interde- 
pendences between these modules. The difficulty of 
this process is illustrated by way of an example in Figure 
2, which shows that a user accesses a web site at the 
global Internet 15 via the proxy server 25. Although the 
data transfer occurs from the Intemet web site via the 
proxy server 25, the actual access requires the access 
of the DNS 23 to obtain the I P address of that particular 
web site. Only after the IP address of the web site has 
been determined, the proxy server 25 can accesses the 
global Internet 15. To assess the status of the ISS 20, 
the operator has not only to check the network routes 
between the proxy server 25 and the global Internet 15 
and between the proxy server 25 and the user terminal 
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1 2. but to be aware of the dependency of the web serv- 
ice on the DNS server 23 and test the DNS server as 
well. If the web operator only checks the modules along 
the network route, the problem module (i.e., DNS 23) 
can not be identified. The diagnosis of availability and 
performance problems in the ISS 20 gets more complex 
as the number of modules in the ISS 20 Increases. 

Another drawback is that since modules are meas- 
ured in isolation, their measurements do not assess the 
availability and performance of the modules as the avail- 
ability and performance relate to the sen/ice being pro- 
vided using the modules. For example, the performance 
measurements for a firewall measured in isolation are 
CPU utilization at the firewall, packet handling rate (i.e., 
packets per second) of the firewall, the delay introduced 
by the firewall in routing a packet, and the number of 
packets discarded by the firewall per second because 
of buffer overflows. However, the impact of the firewall 
performance on the ISS system depends on several fac- 
tors that are external to the firewall. These factors in- 
clude the specific TCP (Transmission Control Protocol), 
the TCP/IP stack used in users' terminal and in the web 
content and proxy servers, the TCP window size used 
by the web browser and web server application mod- 
ules, and the size of the data transfer. Furthermore, the 
location of the firewall in the topology of the ISS system 
determines which of the functional modules the firewall 
impacts. Therefore, the precise impact of the firewall 
performance on the performance of the ISS system can- 
not be measured by considering the firewall in isolation. 

One feature of the present invention is to obtain sta- 
tus (i.e., availability and performance) information of a 
data service system having interdependent modules 
with a minimal number of measurements performed on 
the system. 

Another feature of the present invention is to deter- 
mine the status of a module of a data service system 
having a number of interdependent modules. 

A further feature of the present invention is to iden- 
tify one or more problem modules in a data service sys- 
tem having interdependent modules. 

Described below is a scheme for a data sen/ice sys- 
tem having a number of modules some of which are in- 
terdependent. To measure the status of an individual 
module, the scheme first collects measurements from 
at least one measurement route that involves the mod- 
ule. Then the scheme analyzes the interdependencies 
of the measurements to determine the status of the in- 
dividual module. 

The scheme can also determine status of the data 
service system with a minimal number of measure- 
ments. The scheme first determines all possible meas- 
urement routes for measuring all of the modules based 
on a predetermined topology of the system. The scheme 
then determines the dependencies between the mod- 
ules and the measurement routes. The dependencies 
are then analyzed to select a minimal number of meas- 
urement routes that involve all of the modules to deter- 
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mine the status of the system. 

Moreover, the scheme can diagnose whether a 
module of the data service system is a problematic mod- 
ule or not. This can be done by first analyzing a number 

s of measurements that Involve the module. If one of the 
measurements is good, then the scheme identifies the 
module as non-problematic module. If one of the meas- 
urements that only involves the module is bad, then the 
scheme identifies the module as problematic. 

10 Further, at least one test target is coupled to a mod- 
ule in the data service system. The test target has a pre- 
determined function and performance status. The test 
target allows service test signals to be measured 
through the module. 

IS Other features and advantages of the present in- 
vention will become apparent from the following detailed 
description, taken in conjunction with the accompanying 
drawings, illustrating by way of example the principles 
of the invention. 



20 

Figure 1 shows a prior art data access network sys- 
tem that includes a data service system; 
Figure 2 shows an example of an actual access 
route within the prior art data access network sys- 
25 tem of Figure 1 from the user terminal to the global 
Internet; 

Figure 3 shows a data service system in a data ac- 
cess network system, wherein the data service sys- 
tem includes a measurement system that imple- 
30 ments one embodiment of the present invention; 

Figure 4 illustrates monitors of the measurement 
system of Figure 3 that make measurements at var- 
ious modules; 

Figures 5A through 5D show different implementa- 
35 tions of the monitors; 

Figure 6 shows the service topology of the data 
service system of Figure 3 within the data access 
network system; 

Figure 7 shows the dependency graph between 

40 measurement routes and the Interconnect network, 
the global Internet, the other ISSs, and the modules 
of the data service system based on the service to- 
pology of the data service system; 
Figure 8 shows three different measurement routes 

45 for the data service system; 

Figures 9A and 9B show the measurements ob- 
tained from two of the measurement routes shown 
in Figure 8, wherein Figures 9A and 9B show the 
measurements for two different but overlapping 

50 time periods; 

Figure 9C shows the measurements obtained from 
all of the measurement routes shown in Figure 8 for 
the same time period as shown in Figure 9B; 
Figures 10A through 10D show in flow chart dia- 

55 gram form the scheme of determining the minimum 
number of measurement routes within which the 
status of the data sen/Ice system can be measured; 
Figures 11 A and 118 show the process of finding 
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out the minimum number of measurement routes 
from the dependency graph of Figure 7 using the 
scheme of Figures 10A-10D; 
Figures 12A through 12C show a flow chart diagram 
of a rating process lor assigning ratings for each 
module of the data sen/ice system based on its 
measurements to obtain status of each module; 
Figure 13A and 13B show a flow chart diagram of 
a rating algorithm used by the rating process of Fig- 
ures 12A-12C; 

Figures 1 4A and 1 4B show a flow chart diagram of 
a correlation algorithm used by the rating process 
of Figures 12A-12C; 

Figure 15 shows the display of the measurement 
system, illustrating the status of different modules 
measured by the measurement system; 
Figure 16 shows the process for isolating or identi- 
fying the problematic modules; 
Figure 1 7 shows an example of using the test target 
to provide extra measurement route. 

Figure 3 shows a data access network system 50 
that includes a data service system 60. The data service 
system 60 can be employed by Internet/Intranet service 
providers to offer web services to users or subscribers. 
Thus the data service system 60 can be also referred to 
as an Internet/ Intranet service system (i.e., ISS). The 
ISS 60 includes a measurement system 70 that meas- 
ures and monitors the availability and performance of 
the services offered by the ISS 60 to the users in ac- 
cordance with one embodiment of the present invention, 
which will be described in more detail below. 

In addition to the ISS 60, the data access network 
system 50 also includes an interconnect network 54 and 
links toa global Internet 55 and other ISSs 56. The other 
ISSs 56 may include online service systems, such as 
America Online and Compuserve, local Internet service 
systems, and/or Intranet service systems. From the us- 
er's point of view, access to web content servers in the 
global Internet 55 and the other ISSs 56 can be regarded 
as two other services offered by the system 60. In addi- 
tion, the interconnect network 54 can also be regarded 
as a module of the system 60 from the user's point of 
view. The ISS 60 provides Internet or Intranet service to 
subscriber sites via an interconnect network 54. Figure 
3 only shows one such site 51 . The subscriber sites may 
be at the residences, schools, or offices of the subscrib- 
ers/users. 

The interconnect network 54 can be any known net- 
work. In one embodiment, the interconnect network 54 
is a LAN (Local Area Network) or WAN (Wide Area Net- 
work) network. In other embodiments, the network 54 
can an Ethernet, an ISDN (Integrated Services Digital 
Network) network, an ADSL network, a T-1 or T-3 link, 
a cable or wireless LMDS network, a telephone line net- 
work, or an FDDI (Fiber Distributed Data Network) net- 
work. Alternatively, the interconnect network 54 can be 
other known network. 



The subscriber site 51 includes a terminal 52 and a 
modem 53. The terminal 52 can be a personal computer, 
a network computer, a notebook computer, a worksta- 
tion, a mainframe computer, a supercomputer, or any 

5 other type of data processing system. The modem 53 is 
optional and may be replaced with a network adapter, 
depending on the network technology adopted for the 
interconnect network 54. 

The ISS 60 includes a router 61 for routing data to 

10 and from the remote sites. The router 61 functions to 
connect the remote sites to the appropriate on-premises 
servers 64 through 67, or to the global Internet 55 or the 
other ISSs 56 via a firewall 63. The router 61 may oper- 
ate in the Asynchronous Transfer Mode (ATM) to pro- 

15 vide high bandwidth packet-based switching and multi- 
plexing. The routerSI may include a number of separate 
routers and/or switching devices. 

The servers in the ISS 60 include web content serv- 
ers 66, proxy servers 67, Domain Name Servers (DNSs) 

20 65, address assignment servers (e.g., Dynamic Host 
Configuration Protocol, DHCP sen/ers) 64, and network 
addresses translators 62. In addition, the ISS 10 may 
include other servers (not shown). 

The content servers 66 store Web contents that in- 

25 elude HTML pages, gif images, video clips, etc. Data 
transfers to and from the content servers 66 are enabled 
by transport protocols such as Transport Control Proto- 
col (TCP) and the User Datagram Protocol (UDP). The 
content servers 66 can support a variety of Internet ap- 

30 plications to provide services such as access to the 
World Wide Web, electronic mail, bulletin boards, chat 
rooms, and news groups. 

The proxy servers 67 may be used to enhance se- 
curity of accesses to and from the remote site 51 , as 

35 well as to speed up Internet access by caching frequent- 
ly accessed data locally. The caching function of the 
proxy servers 67 improve the response time that users 
perceive for web accesses. Caching also reduces the 
number of accesses to the global Internet 55. The se- 

^0 curity feature ot the proxy servers 67 allows the web op- 
erator to limit the web sites that users can access. This 
is done by ensuring ail user accesses pass through the 
proxy sen/ers 67. Better security can also be provided 
by restricting only the proxy servers 67 to have direct 

45 access to the global Internet 55. 

The address assignment servers 64 assign an ad- 
dress to a user terminal when it first accesses the ISS 
60. The assigned address uniquely identifies the termi- 
nal in the system 60. The address can be determined 

50 statically (i.e., at the time the user subscribes to the sys- 
tem 60), or dynamically (i.e., the user terminal may get 
a different address each time it is connected to the ISS 
60). Specialized address assignment protocols such as 
Dynamic Host Configuration Protocol (DHCP) are used 

55 by the address assignment servers to assign addresses 
to user terminals. 

The DNS servers 65 provide mapping between host 
names and network addresses. This process is referred 
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to as name resolutbn. Before accessing a web content 
server, the subscriber's web browser application first 
contacts one of the DNS servers 65 to find out the net- 
work address of the web content sen/er The DNS server 
may in tum rely on other DNS senders in the ISS 60, in 
the global Internet 55. or in other ISSs 56 to assist in the 
name resolution. If the resolution fails, the web transfer 
is abortedTo minimize the time for the name resolution, 
the DNS servers 65 maintain local caches of the most 
recently requested host name-to-IP address mapping. 

The firewall 63 controls access to and from the ISS 
60 from the global Internet 55 and other ISSs 56. Fire- 
walls may be implemented in hardware or software and 
are included to enhance security of the ISS 60. Exam- 
ples of known firewall technologies are packet-level fil- 
ters and application-level gateways (e.g., socks). 

The network address translator 62 is used to trans- 
late between a "private" address and a "public" address. 
The "private" address is used for a user to access send- 
ers within the ISS 60 and the "public" address is used 
for the user to access web sites in the global Internet 55 
or other ISSs 56 that are outside the ISS 60. In cases 
when user terminals are assigned "public" addresses di- 
rectly, the ISS does not include the network address 
translator 62. 

One of the methods for testing the performance of 
a service supported by the ISS 60 is to generate test 
traffic or signals that emulate the transmission of data 
when subscribers access the service. For example, to 
assess the performance of web accesses that use the 
Hyper Text Transfer Protocol (HTTP), HTTP-based test 
traffic is transmitted to or from the modules being tested. 
While the server modules such as web content servers 
66 and proxy servers 67 respond to such test traffic, the 
networking modules such as the interconnect network 
54, the firewall 63, and the router 61 do not respond to 
such test traffic, thereby making it difficult to assess the 
impact of these modules on the performance of services 
supported by the ISS 60. To facilitate testing of these 
networking modules, the ISS 60 also includes test points 
or test targets connected to some of the components of 
the system 60. Figure 3 shows one test target 72 asso- 
ciated with the firewall 63. Alternatively, the test target 
72 may be connected to the firewall via other modules. 
Figure 3 also shows a test target 71 associated with the 
interconnect network 54. In practice, many more test tar- 
gets may be used. In addition, the test targets used are 
not limited to just the networking modules. The lest tar- 
gets can also be used to test the functional modules. A 
test target can also be associated with the measurement 
system 70 to initiate test traffic to other functional mod- 
ules and/or test targets. 

Like a subscriber terminal, a test target is also as- 
signed a network address and responds to test packets 
directed to it. Since the test targets are installed in the 
ISS 60 solely to assist in measuring and monitoring of 
services in the ISS 60, they are guaranteed to be avail- 
able for testing at all times (unlike subscriber terminals). 
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Moreover, since the performance of the test targets is 
known a prbri, the test targets serve as reliable end 
points for testing against to evaluate the perfonmance of 
different networking modules of the ISS 60. 
s To support testing of the performance of different 
services, a test target may also emulate the f unctbns of 
specific servers of the ISS 60. For example, a test target 
may host a web server and can respond to HTTP traffic 
directed to it. The test targets are connected to network- 
to ing modules in the ISS 60 to allow service test traffic (e. 
g.. HTTP test traffic) to be measured through those net- 
working modules. In this case, because the status of the 
test target is known, the measurement thus indicates the 
impact of the networking module on the availability and 
15 performance of different services supported by the ISS 
60. The test targets can be implemented by known 
means. The test targets can be regarded as functional 
modules in the system 60. 

The test targets are adding extra measurements 
routes for testing the modules of the system 60. The test 
targets can be connected anywhere in the system 60 to 
provide the extra measurement routes. For example, 
the test target 72 can be connected between the firewall 
63 and the other ISS's 56 to test both the firewall 63 and 
the proxy servers 67. 

All of the servers and firewalls 62-67 in the ISS 60 
can be implemented by known means. The servers and 
firewalls 62-67 are logical components that may be run 
by one or more host computer systems (not seen in Fig- 
ure 3) within the ISS 60. This means that different types 
of servers and firewalls may be running in one host com- 
puter system or one of the servers or firewalls may be 
run by two or more host computer systems. The router 
61 is both a physical component and a logical compo- 
nent of the system 60. All of the components of the. ISS 
60 are referred to as functional modules below. 

The ISS 60 has a physical topology and a service 
topology. The physical topology of the ISS 60 means the 
physical interconnections between (1 ) the remote site 
51 and (2) the router 61 and the host systems of the ISS 
60. As described above, the host systems run the serv- 
ers and firewall 62-67. The service topology of the sys- 
tem 60 describes the dependencies between the func- 
tional modules of the system 60. The sen/ice topology 
of the system 60 also describes the routing of service 
traffic (i.e., service data transfers) through various func- 
tional modules of the ISS 60. In other words, the service 
topology depicts the logical interconnections between 
the different modules. Figure 6 shows the service topol- 
ogy of the ISS 60, including accesses to the global In- 
ternet 55 and the other ISSs 56. It is to be noted that 
there may be other network routing and switching mod- 
ules that are not depicted in the service topology. 

For a given system such as the system 60, the serv- 
ice topology is also known. The service topology of a 
system can be changed by physically reconfiguring the 
host systems of the ISS 60, or by reconfiguring the func- 
tional modules of the system 60. 
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The users of such a system 60 may be personnel 
of an organization (corporation, school, etc.), or residen- 
tial or commercial subscribers accessing the system 60 
from their homes or offices. Using web browser applica- 
tions, users can access Web pages, news, and e-mail 
stored in the web content servers 66 of the system 60. 
In addition, the users can also access Web pages locat- 
ed at remote sites of the global Internet 55 and the other 
ISSs 56. 

The ISS 60 also includes a measurement system 
70 that is responsible for measuring and monitoring the 
status of the ISS 60 and its functional modules. In one 
embodiment and as shown in Figure 3, the measure- 
ment system 70 is connected to the router 61. The 
measurement system 70 can access all of the modules 
51 , 54-56, 61 -67, and 71-72, of the ISS 60 via the router 
61 . In other embodiments, the measurement system 70 
can be connected to any other module of the ISS 60. In 
those embodiments, the measurement system 70 may 
have a dedicated connection line or channel to each of 
the modules 51 , 54-56, 61-67, and 71-72. Alternatively, 
the measurement system 70 may include a number of 
components, located at separate areas as appropriate 
to test the service topology of the ISS 60. 

In accordance with one embodiment of the present 
invention, the measurement system 70 performs sever- 
al measurements of the different modules of the ISS 60 
and analyzes the measurement results to determine the 
status (i.e., availability and performance) of an individ- 
ual module as well as the entire system of the ISS 60. 
The status information indicates whether the system is 
properly functioning or not, and whether any individual 
module is a bottleneck module in the system 60 or not. 
The measurement system 70 determines the status of 
the system 60 with a minimal number of measurements 
or through a minimal number of measurement routes. 
The measurement system 70 also identifies or Isolates 
any problem module or modules in the system 60 by 
analyzing the dependencies between the modules and 
the measurements. 

The measurement system 70 performs a number of 
measurement tests to assess the performance of a sub- 
set of modules of the system 60. Each subset of mod- 
ules tested by a measurement test can be referred to as 
a measurement route. Figure 7 shows a number of 
measurement routes 110 through 116, which will be de- 
scribed In more detail below 

Referring again to Figure 3, to be able to assess the 
health of the data services provided by the system 60, 
the measurement system 70 uses the service topology 
of the system 60. As described above, the service to- 
pology defines the logical dependencies between the 
modules of the system 60. Based on the service topol- 
ogy that is pre -spec If led, the measurement system 70 
determines all the various measurement routes that it 
can use to assess the status of the system 60. From this 
large set of measurement routes, a minimum set of 
measurement routes that must be performed to monitor 



the overall status and performance of the services of the 
system 60. The measurement system 70 then performs 
additional measurement tests to either determine the 
status and performance of a particular module or diag- 
s nose if the particular module is problematic or not. By 
correlating the results of the measurement tests with the 
dependencies between the subsets of modules involved 
in each test, problematic modules within the system 60 
can be isolated or identified. 
10 The measurement system 70 also provides inde- 
pendent visual representations of the test results. This 
provides a high level representation of the status and 
usage of various modules of the system 60. Comparison 
of the test results with pre-specified thresholds is used 
to trigger alarms to automatically alert system operators 
of problem conditions for the data services. 

Figure 4 shows the measurements collected by the 
measurement system 70 from the modules of the sys- 
tem 60. As can be seen from Figure 4, the measure- 
ments are collected through monitors 74 through 79 as- 
sociated with one or more modules of the system 50. 
Each monitor is associated with a measurement route. 
The measurements collected from the different monitors 
are then transported to the measurement system 70, us- 
ing a combination of protocols (e.g., FTP, remote pro- 
cedure invocations, the simple network management 
protocol). The choice of the protocol used may be spe- 
cific to each monitor, depending upon its capabilities, as 
well as upon the type of measurements made by the 
monitor, its storage capacity, and the frequency of the 
measurements. 

A variety of technologies can be employed to imple- 
ment each of the monitors 74-79. Figures 5A-5D show 
four different types of monitors. Again, each of the mon- 
itors 74-79 can be implemented by any one type of the 
monitors shown in Figures 5A-5D. 

Figure 5 A shows an active internal monitor 81 that 
actively executes in the module 80 and gather informa- 
tion about the availability and performance of the data 
sen/ices of the system 60 from the module 80. For ex- 
ample, a script that executes on a web server's host ma- 
chine to track the utilization of the CPU, hard disk, and 
memory and to detemriine queuing of connections in the 
web server's TCP stack falls in this category. Here, the 
module 80 can be any module of the system 60. The 
gathered information is then sent to the measurement 
system 70 by the monitor 81 . The monitor 81 can be a 
software, hardware, or firmware module. 

Figure 5B shows an active external monitor 82 that 
is located in the measurement system 70. The monitor 
82 generates test traffic that emulates typical user ac- 
cesses to the services supported by the system 60 and 
based on the responses from the module 80 being test- 
ed, the monitor 82 actively measures the availability and 
performance of the module 80. The tests are active be- 
cause the monitor generates test traffic to and from the 
target module 80 to perform the measurements. The 
monitor 62 Is extemat because it does not reside on the 
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functional module being tested. The monitor 82 does not 
need to be within the measurement system 70. The 
monitor 82 can reside on any server other than the serv- 
er being tested, or it can reside at the test target 72. The 
active external monitors may not be specific to the op- 
erating system of the host machine that supports the 
module 80 or the specific implementation of the module 
80 being tested. An example of an active external mon- 
itor is a monitor that generates a HTTP request for a 
web page stored at a web server 80 and measures the 
total time taken by the server 80 and the interconnect 
network 54 to transmit the requested web page. 

Figure 5C shows a passive internal monitor 83 that 
is built into the functional module 80 that is being tested. 
In this case, the measurements are collected during the 
normal operation of the module 80. The monitor 83 is 
passive because it measures performance without gen- 
erating additional traffic and without requiring additional 
processes to be executed. The measurements can col- 
lected via log files, Management Information Base (MIB) 
variables exported using management protocols such 
as the Simple Network Management Protocol (SNMP). 
Since they rely on instrumentation built in to the mod- 
ules, the passive internal monitors are specific to the 
application, hardware: or system being monitored. An 
example of a passive internal monitor is a web server 
that during its normal operation measures the time taken 
to respond to a web page request and logs this informa- 
tion in Its access log file. 

Figure 5D shows a passive external monitor 84 that 
resides outside the module being monitored. This mon- 
itor operates passively because it resides on the path of 
data transmissions from servers to subscriber sites and 
collects measurements by snooping packets transmit- 
ted over the network 54. By analyzing the sequence of 
packets transmitted, their transmission times, and the 
gaps between packet transmissions, the monitor pas- 
sively collects information that is useful for the measure- 
ment of the data services of the system 60. An example 
of a passive external monitor is a LanProbe device that 
tracks the performance of a web server by estimating 
the time interval between transmission of a request from 
a subscriber site and the transmission of the web page 
from the web server 80. Like the passive Intemal monitor 
83, the passive external monitor 84 makes its measure- 
ments by observing real user traffic and without gener- 
ating any additional test traffic. 

As described above, any of the monitors 74-79 in 
the system 60 can fall in any of the above mentioned 
four categories of monitors. The specific monitors to be 
used in a system 60 can be specified to the measure- 
ment system 70 by the system operator Alternatively, 
the measurement system 70 may itself choose a default 
set of monitors to use based on the service topology of 
the system 60. This set of monitors can be modified at 
any time by the system operator. 

Referring to Figures 6 and 7, based on the service 
topology of the system 60 and the location and type of 
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the monitors 74-79, the measurement system 70 deter- 
mines the measurement routes 110-1 1 6. The measure- 
ment system 70 then determines the dependencies be- 
tween (1) the measurement routes 110-116 and (2) the 

s modules of system 60« the interconnect network 54, the 
global Internet 55, and the other ISSs 56. 
Figure 7 shows such a dependency graph. 

The measurement routes span various layers of 
protocols, measuring different features of the functional 

10 modules. The measurement routes also depend on the 
functional modules being tested. For example, the time 
taken by a web content server 66 to respond to a request 
for a web page is a measure of a web content server's 
status and can be measured by a measurement tool that 

7^ emulates the typical behavior of a web browser used by 
subscribers. The same tool can be used to measure 
proxy servers 67 as well. For proxy servers 67, the 
measurement tool can either retrieve web pages from 
the proxy cache itself. The same tool can test another 

20 measurement route by forcing the proxy server to re- 
trieve the web page from a remote web sen/er, rather 
than from Its cache. A third measurement route can get 
the data from the web server directly, rather than via the 
proxy server. 

25 The measurement routes may also be based on 
passive methods. For instance, content and proxy serv- 
er residence time for web accesses can obtained by a 
test tool that analyzes measurements made by the con- 
tent and proxy senders 66-67 and stored in their access 
30 logs. Different measurements results can be extracted 
for requests that are serviced from the proxy server's 
cache, for accesses to the global Internet 55, and to oth- 
er ISS sites 56, and used to investigate for problems in 
the modules of the ISS 60 that can affect these access- 
es es. 

The measurement route of the DHCP servers 64 
may include measurements of the rate of address as- 
signment failures, the response time for an address as- 
signment, and the number of addresses allocated per 

40 network subnet. The test tool used to obtain the meas- 
urements can be a script that analyzes DHCP log files. 

The measure of availability of a DNS server 65 is 
the percentage of DNS queries that it is able to success- 
fully resolve. The performance of the server is the re- 

45 sponse time between when a request for name lookup 
is issued and when a response is received. Both avail- 
ability and performance can be best measured using in- 
strumentation built into proxy servers 67 that permit 
these servers to measure and log response times for 

50 DNS requests. A test tool that analyzes the log files of 
proxy servers can provide a passive measurement of 
DNS status for real user requests. In addition, the meas- 
urement routes for the DNS sen/ers 65 also includes ac- 
tive measurements of their response to specific queries 

55 using a public domain tool called "dig". 

As described above, to assess the status of the net- 
working modules such as the interconnect network 54, 
the router 61 , and the firewall 63. test targets 71 . 72 that 
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are specifically installed in the ISS 60 for testing can be 
used. Alternatively, in some cases, subscriber sites 
themselves can be used as targets to which measure- 
ment tests can be run. One of the advantages of install- 
ing specialized test targets 71, 72 in the system 60 is 
that the test target can support different server applica- 
tions that enable service test traffic to be targeted to or 
generated from the test targets themselves. For in- 
stance, a test target can support a web server and can 
respond to test signals sent to it to transmit web pages. 
When connected to a firewall 63, such a test target can 
be used to assess the performance o1 web traffic trans- 
mitted through the firewall, without involving a number 
of the other components of the ISS 60. Such test targets 
may be the only way of testing modules such as a 
"socks'* based firewall that only allows web and other 
TCP-based traffic to flow through It and does not re- 
spond to other non-TCP traffic. 

Another advantage of the test target is that for some 
service topologies, the service topology may not permit 
sufficiently diverse measurement routes to be devel- 
oped to diagnose or Identify problem modules. In such 
cases, by placing test targets at strategic places in the 
system 60, problems may be isolated effective and effi- 
ciently. Figure 17 shows an example. As can be seen 
from Figure 17, tests can be run to the remote webserv- 
er 500 of the other ISSs 56 (Figure 3) via (1) the proxy 
server 67 and the firewall 63 and (2) the firewall 63 
based on the sen^ice topology shown In Figure 17. 
When measurements from both measurement routes 
show problems, there is no direct way to ascertain 
whether the problem is because of the remote web serv- 
er 500 or because of the firewall 63. This means that the 
available measurement routes are not sufficient to either 
eliminate the firewall 63 as the problem module or pos- 
itively identify it as the problem source. With the test tar- 
get 72 associated with the firewall 63, an additional 
measurement route can be established to measure the 
impact of the firewall 63. 

All of the measurement routes discussed so far are 
made to test impact of the individual module or a set of 
modules on the availability and performance of the data 
services and are referred to as the surveillance meas- 
urements. Besides these measurements, the measure- 
ment system 70 may enable other measurements to 
monitor specific features of the functional modules, in 
order to be able (1 ) to provide detailed information about 
a module that has been diagnosed as being a problem 
module, and (2) to be able to analyze the usage of each 
module and provide advance alerts than can enable a 
system operator to do capacity planning. Depending up- 
on whether they support diagnostic or planning func- 
tions, the additional measurements necessary in a sys- 
tem 60 are referred to as the diagnostic measurements 
and the usage measurements. These additional meas- 
urements do not appear in the dependency graph of the 
system 60 and are not used to determine the current 
status of the system 60. 



The diagnostic and usage measurements are usu- 
ally specific to the different functional modules. For ex- 
ample, for modules that support TCP connections such 
as web content servers 66, proxy servers 67, and 

5 "socks^-based firewalls 63, the additional measure- 
ments can include measurements of the connection rate 
the modules made either from the modules' logs or from 
SNMP (Simple Network Management Protocol) MIBs 
(Management Information Base) of the modules, or by 

10 executing the netstat command on the modules' host 
machines. In addition, measurements of TCP listen 
queue length can be obtained using custom tools that 
lookup the relevant data structures in the kernels of the 
modules' host machines. Since excessive queuing in 

15 the TCP listen queues can result in long delays in serv- 
ice, monitoring of the TCP listen queue lengths can give 
an indication of misconfiguration of specific modules of 
the 188 60. 

For all the modules of the ISS 60, including content 
20 servers 66, proxy servers 67, firewall 63, DNS servers 
65, DHCP servers 64 and routers 61 , the CPU, disk, and 
memory utilization of the host machines and the individ- 
ual modules can be measured, since overutilization of 
any of these fundamental resources can result in poor 
25 performance of the services supported by the ISS 60. 
These measurements can be obtained using test tools 
like Unix utilities such as netstat, vmstat, and iostat. 

To assist in capacity planning, the usage measure- 
ments Include measurements of network traffic support- 
30 ed by each of the network interfaces of the various host 
machines of the ISS 60. These measurements can be 
made by querying the MIB-ll agents that are supported 
by most host machines. To facilitate more careful plan- 
ning, the traffic supported by each of the modules must 
35 be determined, using measurements that are specific to 
the modules. For instance, the traffic to the content and 
proxy servers 66 and 67 can be measured from their log 
files. On the other hand, traffic supported by the DNS 
senders 65 may have to measured by executing a proc- 
40 ess on each DNS server host that periodically commu- 
nicates with the DNS server using inter-process com- 
munication techniques such as signals to extract usage 
information measured by the DNS servers themselves. 
To assess the status of the system 60, the meas- 
es urement system 70 uses the surveillance measurement 
routes alone. To minimize the overheads of the meas- 
urements on the system 60, the measurement system 
70 selects a minimum number of measurement routes 
that cover all modules of the system 60. Those meas- 
50 urement routes, however, provide essential measure- 
ments for determining the overall status of the system 
60. For example, as can be seen from Figure 7, the 
measurement route (i.e., the route 113) that retrieves a 
web page from a local web server that is located in the 
55 other ISSs 56 provides information about the health of 
the proxy servers 67^ the DNS servers 65, the firewall 
63, and the local web server in the other ISSs 56. If the 
measured response time for retrieving web pages from 
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the local web server in the other ISSs 56 is less than a 
pre-specified threshold, the DNS and proxy servers 65 
and 67, the local web server, and the firewall 63 are func- 
tioning as expected. If the measured response time Is 
greater than the pre-specified threshold, then all of the 
modules are suspect modules and more measurements 
are need to show the status and performance of the in- 
dividual modules, and to isolate the problem module or 
modules. 

The process of determining the minimum number 
of measurement routes is shown in Figures 10A-10D, 
which will be described in more detail below. The proc- 
ess of determining the status and performance of an in- 
dividual module is shown in Figures 12A-12C, 13, and 
14A-14B, which will be described in more detail below. 
The process of isolating the problem module is shown 
in Figure 16, which will also be described in more detail 
below. Figure 8 shows an example ot isolating or iden- 
tifying a problem module using three measurement 
routes 90 through 92. Figures 9A-9C show the meas- 
urements over time of these three measurement routes. 
Figures 9A and 9B show the measurements of the same 
two measurement routes, but over different and yet 
overlapping time periods. In Figures 9A-9C, the curve 
1 20 indicates the measurements obtained via the meas- 
urement route 90 and the broken line curve 110 Indi- 
cates the measurements obtained via the measurement 
route 91 . In Figure 9C, the solid line curve 1 30 indicates 
the measurements obtained via the measurement route 
92. 

As can be seen from Figure 8, these three meas- 
urement routes all pass through the firewall 63. The 
measurement route 90 measures the status of the other 
ISS's web sen/er 56 via the firewall 63 and the proxy 
servers 67. The route 91 is only via firewall 63. The 
measurement route 92 measures the response time to 
an Internet web server Internet 55 via the firewall 63. 
Therefore, if the measurement of any one measurement 
route is good, the firewall 63 can be identified as a non- 
problematic module. If, on the other hand, one meas- 
urement route (e.g., the measurement route 90) indi- 
cates that the performance of the system 60 is not meet- 
ing the expectation, then measurements can be taken 
from the measurement routes 91 and 92 to find out if the 
firewall 63 is the bottleneck module. If the measurement 
from the measurement route 91 indicates good meas- 
urement results, then the firewall 63 can be identified as 
the non-problematic module. If the measurement route 
still indicates problem, then the measurement route can 
be taken to further determine if the firewall 63 can be 
isolated. 

As can be seen from Figures 9A-9C, a performance 
problem is indicated between 3:00PM and 5:00PM for 
the proxy servers 67 because the curve 120 is above 
the threshold level and the curve 110 is below the 
threshold level. In this case, because the curve 110 in- 
dicates no problem, the firewall 63 and the other ISS's 
web server 56 are not having problems. This basically 
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isolates the problem to the proxy servers 67. 

Another problem is indicated between 7:00PM and 
9:00PM (see Figure 9B). At this time, there are two com- 
mon modules (i.e., the other ISS's web server 56 and 
5 the firewall 63) and more measurement routes are need- 
ed to isolate the problem. 

To find out if the firewall 63 is performing as expect- 
ed, the third measurement route 92 is used. As can be 
seen from Figure 9C, the curve 130 indicates problem- 

10 atic perfornr\ance or bottleneck at the same time. This 
basically identifies the firewall 63 as the most likely prob- 
lematic module. As described above, the process of iso- 
lating the problem module is shown in Figure 16, which 
will be described in more detail below. 

'5 Figures 10A-10D show the process of determining 
the minimum set of measurement routes necessary to 
assess the status of the system 60. The process shown 
Is a graph theoretic algorithm for dissecting the depend- 
ency graph. In Figures 10A-10D, the dependencies D 

20 are the connection lines between the measurement 
routes 110-11 6 and the modules 54-56 and 63-67 in Fig- 
ure 7. The status "NOT_SEEN'' indicates that the mod- 
ule has not been processed by the algorithm. The status 
"SEEN" indicates that the module has been processed 

25 The key idea of this algorithm is to dissect the depend- 
ency graph by first starting with the module that has a 
minimum number of dependencies. This ensures that 
the minimum set of measurement routes can be deter- 
mined in the fastest possible manner Other altemative 

30 approaches can also be used to dissect the dependency 
graph. For example, the dissection can begin with the 
measurement route that has maximum number of de- 
pendencies on modules. However, this approach re- 
quires a more elaborate trial and error procedure that 

35 takes longer to execute. 

Applying the process of Figures 1 0A-1 OD to the de- 
pendency graph of Figure 7, the measurement system 
70 determines the minimum set of measurement routes 
needed for assessing the status of the system 60. The 

40 process is illustrated in Figures 11 A and 118. Because 
the DHCP servers 64 and the interconnect network 54 
are independent modules, the process of Figures 10A- 
10D first includes the measurement routes correspond- 
ing to these modules in the minimum set of measure- 
rs ment routes for the system 60. Hence, these measure- 
ment routes are not shown in Figures 11 A-11B. 

As can be seen from Figure 11 A, the Internet web 
server module 55 is first selected as it has minimum 
number of dependencies (i.e., two). In this case, be- 

50 cause both the proxy server 67 and the other ISS's serv- 
er 56 also have the minimum number of dependencies 
(i.e., 2), it is not required that the module 55 be selected. 
Figures 1 1 A and 1 1 B show one example that starts with 
the module 55. 

55 As can be seen from F igure 1 1 A, one of the depend- 
encies ot the module 55 which leads to the measure- 
ment route 115 is first selected (again randomly). The 
result is shown in block 1700 in which the left side shows 
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the selected modules and right side shows the routes 
and the nnodules that are left in the dependency graph 
of Figure 7. Considering the dependency graph of block 
1 700 that is left after the first step of the process of Fig- 
ures 10A-10D, there are two possible choices for the s 
module with minimum number of dependencies (i.e. , 2). 
These modules are the proxy server 67 and the other 
ISS's server 56. Figure 11 A shows the case when the 
process chooses the other ISS's server 56 at random. 
Since there are two dependencies for the other ISS's 
server in the dependency graph of block 1 700, there are 
two possible paths from block 1700 that the process 
considers. Choosing the measurement route 113 results 
in both the other ISS's server 56 and the proxy server 
67 being tested. This is shown in block 1 701 . Since there 
are no more modules left, one set of measurement 
routes that can test the entire system 60 is the meas- 
urement routes chosen In blocks 1700 and 1701, I.e.. 
routes 115 and 113. Another set of measurement routes 
is shown in block 1702 that results when route 112 is 
chosen to cover the other ISS's server 56. To cover the 
remaining module, the proxy server 67, either route 1 1 3 
or 114 can be chosen (shown by blocks 1 704 and 1 703 
In Figure 11 A). In either of the paths leading to blocks 
1703 and 1704, three measurement routes (115, 112 
and 1 1 3 or 1 1 4) are needed to test the system 60. 

Figure 11 B shows an alternate case in which the 
process the other dependency of the module 55 which 
leads to the measurement route 114. The result Is 
shown in block 1705 in which the left side shows the 
selected modules and right side shows the routes and 
the modules left. As can be seen from the block 1705. 
the measurement route 115 is determined to be redun- 
dant and eliminated. The blocks 1705 and 1707 show 
the following step to select the next measurement route 
(which can be either the route 112 or the route 113). In 
this case, only two measurement routes (I.e., 112 and 

1 1 4 or 1 1 3-1 1 4) are needed to measure the status of the 
entire system. Considering all of the cases shown in Fig- 
ures 1 1 A and 1 1 B, the process chooses one of the sets 
of measurement routes 114 and 112, 114 and 113, or 

115 and 1 1 3 as the minimum set of measurement routes 
for the system 60. 

After the minumum set of measurement routes is 
determined and selected, the system 60 can be moni- 
tored through the selected minimum set of measure- 
ment routes. These selected measurement routes pro- 
vide the essential measurements that Indicate whether 
the system 60 Is functioning as expected or not. The re- 
sulting measurements can provide real-time indication 
of the status of the system 60. The measurements can 
be Integrated to present a single rating of the status of 
the system 60 in totality. The rating can be depleted In 
various ways. The display can also show a "high water- 
mark", indicating the worst rating achieved over a past 
period of time. 

Because the modules In the system 60 have Inter- 
dependencles. the measurements obtained from the se- 



lected minimum set of measurement routes cannot de- 
termine the status of any Individual module within the 
system 60. These minimum set of measurements only 
Indicate that one or more modules In the system 60 may 
not be performing up to expectation but they cannot help 
pin-point the problematic module(s) or provide any indi- 
cation of performance trends for the individual modules. 
For example, a bottleneck firewall may slow down all 
connections it handles, resulting in the minimum set of 
measurements indicating that the system 60 has a per- 
formance bottleneck. The true status of the individual 
modules cannot be determined from the minimum set 
of measurement routes alone. Thus, more measure- 
ments routes that involve a particular module are need- 
ed such that correlation of these measurements can be 
made to determine the status and performance of the 
particular module. Described below is a rating function 
that correlates measurements obtained from various 
measurement routes involving the same module to de- 
termine the status of the module. 

The rating function can operate in real-time, receiv- 
ing the results obtained using different measurement 
routes as they become available. The rating function 
correlates these results to compute a single rating for 
each module. The rating for a module indicates the sta- 
tus of the module. Ratings are assigned based on 
thresholds of performance expectations established by 
the system operator. Ratings are assigned on a scale 
starting with 0 A rating of 1 Indicates that a module's 
performance has reached the acceptable threshold of 
perfomnance. Any further deterioration in performance 
Is viewed as a problem (I.e., fault) In the system 60. A 
module with a rating less than 1 is not a problem module, 
while a module with a rating greater than 1 Is a problem 
module. The extent of badness of a module is indicated 
by the extent to which the rating of the module exceeds 
1. 

Figures 12A-12C describe the rating function. To 
derive the rating for each Individual module of the sys- 
tem 60, the rating function uses the dependency graph 
for the system 60 shown in Figure 7, Considering a spe- 
cific module for which a rating must be derived, In step 
201 , the rating function first determines all the measure- 
ment routes In the dependency graph that cover the 
module. For example^ in Figure 7, the Internet web serv- 
er module is covered by two measurement routes 114 
and 115. In Step 202= the rating function checks to see 
If there is at least one measurement route that covers 
the specific module. If not, Step 203 Is executed, Indi- 
cating that no rating can be assigned to this module, and 
the rating function terminates in Step 204. 

If at least one measurement route covers the spe- 
cific module. In Steps 205 and 206, the rating function 
constructs a measurement matrix, the number of col- 
umns and rows of which are equal to the number of 
measurement routes covering the specific module. The 
measurement matrix Is used to compare all the meas- 
urement routes to see If there are any overlap or redun- 
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dancy between measurement routes. To be able to com- 
pare all pairs of measurement routes, in Step 206. all 
entries In the measurement matrix are marked as 
NOT_SEEN to indicate that the entries have not been 
processed by the rating function. Step 207 is a decision 
making step, when the rating function considers if there 
are any more pairs of measurement routes that need to 
be processed. If yes, Step 208 is executed to consider 
any pair of routes that have not been processed yet. In 
Step 209, the dependency list (which refer to the set of 
modules covered by a measurement route) of one 
measurement route is compared with that of another to 
see if there is any overlap or redundancy between the 
dependency lists. If there is complete overlap, indicating 
that the dependency list of one route is a super-set of 
the dependency list of another, Step 211 is executed to 
determine if the dependency lists are exactly the same. 
If yes, Step 21 4 is executed. In this step, the correlation 
algorithm shown in Figure 14A-B is applied to synthe- 
size one correlated measurement from the two meas- 
urement routes being considered. Since the dependen- 
cy lists of the measurement lists are the same, the max- 
imum function is applied during correlation in Step 214, 
In Steps 21 5 and 216, the two measurement routes are 
replaced in the dependency graph and the measure- 
ment matrix by the new correlated measurement. Then 
Step 207 is repeated to compare all remaining pairs of 
measurement routes. 

In Step 211, if it is determined that the dependency 
lists are not the same, but one measurement route's de- 
pendency list is a proper sub-set of the other measure- 
ment route's dependency list, the latter measurement 
route is deleted from the dependency graph and the 
measurement matrix in Steps 212 and 213 since this 
route is considered to be a redundant measurement 
route. Then Step 207 is executed. If Step 209 is not true, 
Step 210 is executed to indicate that the current pair of 
measurement routes has been processed. Then Step 
207 is executed again. 

If at Step 207, it is determined that all pairs of meas- 
urement routes have been considered, Step 217 is ex- 
ecuted to apply the correlation function shown in Figures 
14A-1 4B to results obtained using the remaining meas- 
urement routes. In this case, the minimum function is 
applied. To the resulting correlated measurement, the 
rating algorithm of Figure 1 3 is applied to compute an 
instantaneous rating for the module. Step 220 termi- 
nates the rating function. 

Figures 14A-14B illustrate the correlation algorithm 
called during the execution of the rating function. The 
inputs to the correlation algorithm are the identities of 
the measurements being correlated together, the 
threshold values for each of the measurements, the fre- 
quency of each of the measurements, a rating period, 
and the type of correlation necessary. The rating period 
is a duration in time for which the correlation applies. 
The type of correlation necessary means whether the 
minimum or maximum of the correlated values are to be 
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used to compute the rating. 

Before performing any correlation, in Step 452, the 
rating interval is computed based on the frequencies of 
the measurements specified to the correlation algo- 
5 rithm. The rating interval denotes the time period in 
which at least one result from each of the measurements 
is likely to be available to the correlation algorithm. 

To perform the correlation, the algorithm uses a slid- 
ing time window to determine measurements that 
^0 should be considered. This means when new measure- 
ment results are provided by the monitors, one or more 
of the past measurement results may be discarded and 
a new rating computed. The time period for which a 
measurement result is relevant to the correlation algo- 
'5 rithm is the rating period. In Step 453, the rating interval 
is compared with the rating period. If the rating period is 
less than the rating interval. Step 454 is executed to en- 
sure that the rating period is reset to be equal to the 
rating interval. If not, the algorithm proceeds to Step 
455. A timer is set so that the correlation is performed 
once every rating period. Step 456 is executed to wait 
for the timer to expire. 

In Step 457, the algorithm considers the results 
from ail the measurements in the each rating interval. 
For measurements that generate more than one result 
in a rating interval, the algorithm considers a statistic of 
the results in the rating interval. In Figure 14B, one em- 
bodiment, the use of the maximum statistic is shown. 
Other embodiments of this algorithm may use other sta- 
tistics such as the minimum, the mean, or the median 
of the results. This step ensures that measurements can 
be correlated independent of the frequency with which 
they generate results. For example, a proxy server 67 
measurement may generate results once every 2 min- 
utes, whereas a DNS server measurement may gener- 
ate results once every 4 minutes. This step is referred 
to as normalization of frequencies of the measurements. 

In Step 458, for each of the measurements, the re- 
sults for each rating interval are divided by the corre- 
sponding threshold values, so that all the results can be 
placed on a common linear scale starting at 0. This step 
allows the correlation of different types of measure- 
ments. For example, the percentage of errors seen by 
a proxy server 67 may be one measurement and the 
response time of the proxy server 67 to web page re- 
trieval requests may be another measurement. This 
step is referred to as normalization of the scales of the 
measurements. 

Step 459 computes the number of rating intervals 
in a rating period. Step 460 considers the results in the 
last rating inten/al that must be considered for correla- 
tion. 

Step 461 is executed to apply the maximum or min- 
imum functions as appropriate to output correlated 
measurement results for the last rating period. 

The rating function of Figures 12A-12C applies a 
rating algorithm to the correlated results from the corre- 
lation algorithm of Figures 14A-14B. Figures 13A and 



EP 0 883 271 A2 



25 



30 



35 



40 



45 



SO 



BNSDOCID: <EP_0883a71A2_L> 



11 



21 



EP 0 883 271 A2 



22 



13B show the rating algorithm. In Step 401, the rating 
algorithm collects all the results from the correlated 
measurements during a rating period. Rather than sim- 
ply using a statistic of the collected such as the median 
or the mean, the rating function uses a heuristic to rate 
the results in a way that mimics how human operators 
tend to perceive the status of the system. To enable this 
ranking scheme, in Step 402, the algorithm sets the sta- 
tus of each of the results to be rated to NOT_SEEN, In- 
dicating that the results have not been processed at this 
step. Step 403 is a check to see if all the results have 
been processed. If not, at Step 404, one of the results 
whose status is NOT_SEEN is chosen. In Step 405, the 
result is compared to 1 . If the result is less than or equal 
to 1 , the rating assigned is less than or equal to 1 . In this 
case, Step 407 is executed. In this step, the rating heu- 
ristic used assigns exponentially increasing ratings to 
results as they approach 1 , thereby ensuring that the 
rating approaches 1 at a rate that is faster than that at 
which the result approaches 1 . This serves to alert a sys- 
tem operator in advance, even though the status of the 
system 60 has not yet degraded beyond acceptable lim- 
its. 

At Step 405, if the result is greater than 1 , Step 406 
is executed. In this case, the rating assigned is greater 
than 1 since the result is greater than 1 . A function that 
logarithmically increases the rating with increase in the 
result is used at this step. Again, this step emulates a 
system operator's view of the status of the system, be- 
cause beyond the acceptance limit, the criticality of the 
system 60 as viewed by the system operator does not 
increase linearly. The precise logarithmic and exponen- 
tial functions used in Steps 406 and 407 are chosen 
such the rating is 1 when the result is 1 . Figures 1 3 A 
and B describe one embodiment of the rating algorithm 
in which the logarithmic and exponential functions are 
based on a base of 2. Any alternative value can be used 
in other embodiments. 

After Steps 406 and 407, Step 408 is executed to 
set the status of the result to SEEN, indicating that the 
value has been processed. Then Step 403 is repeated. 
If in Step 403 It is determined that all results have been 
processed, the algorithm moves to Step 409. In this 
step, a statistic of the ratings assigned to each of the 
measurement results in the rating period is used as the 
aggregate rating. Figure 136 shows one embodiment in 
which the average of the ratings is used as the aggre- 
gate rating for the rating period. The aggregate rating is 
reported in Step 410 and the algorithm ends in Step 411 . 

Figure 15 shows an example of the use of the rat- 
ings for various modules of the system 60 obtained by 
applying the rating function of Figures 12A-12D. As can 
be seen from Figure 15, the rating can be specified on 
a scale of 0 to infinity, with a rating of 0 representing 
perfect idealized behavior and a rating of 1 or more in- 
dicating whether the availability and/or performance of 
an individual module has dropped below expectation. 
The ratings of all the individual modules are then gath- 



ered together to form the radar diagram of Figure 15. 
The color of the shaded area Indicates whether the sys- 
tem 60 is performing to expectation or not. For instance, 
the shaded area can be colored green if all the modules 

5 of the system 60 have ratings below 1 . The color can be 
changed to red, indicating a problem in the system 60, 
if any of the modules has a rating of 1 or more. To give 
an indication of trends in performance of the different 
modules, even when the rating is below 1 , a high water- 

10 mark that indicates past performance of the modules is 
shown in Figure 15 for each of the modules. 

Figure 15 represents one possible way of using the 
ratings to present an integrated view of the status of the 
system 60 and its modules. Other altemative visual rep- 

is resentations (e g., using gauges to represent the status 
of each module) are possible. 

The measurement system 70 (Figure 3) can be in- 
tegrated with network management platforms such as 
HP OpenView (sold by Hewlett-Packard Co. of Palo Al- 

20 to, CA) and NetManager (sold by Sun Microsystems), 
to enable the generation of alarms based on trends in 
the performance of any of the modules or the system 60. 

This display function of the measurement system 
70 provides real-time access to the result of the meas- 

25 urements of the individual module. This result can also 
be made available in real-time to facilitate the diagnosis 
process in isolating the problem modules, which will be 
described in more detail below. 

Figure 16 shows the diagnosis process for isolating 

30 or identifying problematic modules in the system 60 
based on the status display of Figure 15. Figure 16 
shows the algorithm for diagnosing the problem module 
using the dependency graph of Figure 7. Diagnosis is 
simple if there is only one module in Figure 15 that has 

35 a rating of 1 or more. In this case, the module with a 
rating of 1 or more is the problem module. When several 
modules have a rating of one or more, it is unclear 
whether all of those modules are problem modules or 
whether only a subset of those modules are problem 

40 modules. The complexity in diagnosis occurs because 
of the interdependencies between modules. As can be 
seen in the example of Figures 1 5, two modules, namely 
the proxy server 67 and the firewall 63 have ratings of 
one or more. From the service topology in Figure 7, it is 

45 clear that a bottleneck firewall 63 may slowdown all con- 
nections it handles, resulting in an accummulation of 
connections at the proxy server 67. Consequently, the 
proxy server 67 itself may appear to be slowing down in 
its performance. In this case, the degradation in per- 

50 formance of the proxy server 67 is attributable to the 
firewall 63. Hence, the firewall 63 is the real problem 
module, although the proxy server 67 also has a rating 
of 1 or more. 

In order to isolate a module (i.e., identify whether 
55 the module has a problem), the module must have been 
measured by at least two different measurement routes. 
The algorithm of Figure 16 uses two approaches for iso- 
lating a module. One is referred to as positive isolation 
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and the other is referred to as negative isolation. The 
positive isolation identifies the module as the malfunc- 
tioning module when two measurement routes A and B 
can be determined such that (1) the dependency list of 
B includes the module under consideration but the de- 
pendency list of A does not, (2) the dependency list of 
A is a subset of the dependency list of B, and (3) meas- 
urement route A is performing up to expectation but 
measurement route B is not. The negative isolation rules 
out the functioning modules. It determines whether a set 
of modules involved in a number of measurement routes 
can be determined as good. By choosing a number of 
measurement routes that rule out a number of modules, 
the problem module or modules can be identified. 

Figure 16 shows the diagnosis process in detail. In- 
itially, in Step 501 , the status of all modules is marked 
as UNKNOWN. In Step 502, all modules are marked as 
NOT_SEEN to indicate that the process has not consid- 
ered these modules as yet. In Step 503, the process 
checks to see if all modules have been processed. If 
not, the process continues to Step 504. In Step 504, one 
of the modules not processed is considered. In Step 
506, negative isolation is applied. If any measurement 
route involving the module is good, Step 507 is executed 
to set the status of the module to GOOD. At the same 
time, in Step 508, all modules whose status is UN- 
KNOWN are marked as NOT_SEEN to indicate that the 
process needs to consider these modules again, to 
check if it can now determine whether these modules 
are GOOD or BAD. If negative isolation is not possible 
in Step 506, i.e., the answer to the comparison is no, 
Step 520 is applied. This step executes the positive iso- 
lation rule to see if a module can be marked as BAD. If 
yes, Step 521 is executed, followed by Step 508. After 
Step 508, the process returns to check for other 
NOT_SEEN modules. If Step 520 does not succeed, the 
status of the module is left as UNKNOWN and the proc- 
ess proceeds to Step 503. 

If at Step 503, it is determined that all modules are 
marked as SEEN, Step 509 Is executed to report the 
status of all modules. Modules reported BAD at this step 
are the primary problematic modules in the system 60. 
Modules reported UNKNOWN are modules whose sta- 
tus may be affected by the problematic modules of the 
system 60. 

In the foregoing specification, the invention has 
been described with reference to specific embodiments 
thereof. It will, however, be evident to those skilled in 
the art that various modifications and changes may be 
made thereto without departing from the broader spirit 
and scope of the invention. The specification and draw- 
ings are, accordingly, to be regarded in an illustrative 
rather than a restrictive sense. 
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a plurality of modules (51. 54-56, 61-67) of a data 
service system (50), the method comprising: 

(A) collecting measurements from at least one 
measurement route that involves the module; 

(B) analyzing interdependencies of the meas- 
urements to determine the status of the mod- 
ule. 

The method of claim 1 , wherein the step (A) further 
comprises 

(a) finding all measurement routes (110-116) 
that involve the module; 

(b) collecting the measurements from all the 
measurement routes (110-116) that involve the 
module. 

The method of claim 1 or 2, wherein the step (B) 
further comprises the steps of 

(I) determining the status of the module if only 
one measurement route (110-111, 116) in- 
volves the module; 

(II) if two or more measurement routes 
(112-115) involve the module, then correlating 
the measurements based on their interdepend- 
encies. 

A method of determining if a module among a plu- 
rality of modules (51 , 54-56, 61-67) of a data service 
system (50) is a problematic module, comprising: 

(I) analyzing a number of measurements that 
involve the module; 

(II) if one of the measurements is good, then 
identifying the module as non-problematic; 

(III) if one of the measurements that only in- 
volves the module is problematic, then identi- 
fying the module as problematic. 

The method of claim 4, further comprising the step 
of identifying the module as problematic if one of 
the measurements that only involves the module 
and other known non-problematic modules is prob- 
lematic, wherein the step (A) further comprises 

(a) determining from a dependency graph 
whether the number of measurements involve 
other known problematic modules, wherein the 
dependency graph shows interdependencies 
between the measurements and the modules; 

(b) determining which measurement of the 
measurements does not involve other known 
problematic modules; 

(c) identifying that measurement as good. 



1 . A method of detennining status of a module among 
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system (50) by a minimal number of measurements, 
wherein the data service system (50) has a plurality 
of modules (51, 54-56, 61-67), comprising: 

(A) determining all possible measurement s 
routes (110-116) for measuring all of the mod- 
ules (51, 54-56, 61-67) based on a predeter- 
mined topology of the system (50); 

(B) determining dependencies between the 
modules (51 , 54-56, 61-67) and the measure- io 
ment routes (110-116); 

(C) analyzing the dependencies to select a min- 
imal number of measurement routes that in- 
volve all of the modules (51, 54-56, 61-67) to 
determine the status of the system (50). is 

7. The method of claim 6, wherein the step (B) further 
comprises the step of generating a dependency 
graph between the modules (51 , 54-56, 61-67) and 
the measurement routes (110-116). 20 

8. The method of claim 6 or 7, wherein the step (C) 
further comprises the steps of 

(I) determining the number of measurement 2S 
routes each module is involved; 

(II) selecting a module that is covered by a min- 
imum number of measurement routes; 

(III) removing the module and one of the meas- 
urement routes that covers the module from the 30 
dependency graph, wherein all other modules 
covered by the particular measurement route 
are also removed from the dependency graph; 

(IV) repeating the steps (II) and (III) until there 

is no module in the dependency graph; 3S 

(V) calculating a first total number and total set 
of the measurement routes removed from the 
dependency graph. 

9. The method of claim 8, further comprising the step 40 
of repeating the steps (I) through (V) for each of the 
dependencies involving the selected module until 
the minimal number and set is determined. 

10. The method of any of claims 1 -9, wherein the meas- 45 
urements include active measurements and pas- 
sive measurements, wherein the modules (51, 
54-56, 61 -67) can be functional modules or physical 
modules and the data service system (50) is a web 
service system, wherein the status includes the in- so 
formation of availability or performance which are 
then displayed in the data sen/ice system (50). 
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